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Nature of the concept. Intrinsic dimensionality of data does not refer to a single well-defined 
parameter, but rather a diverse family of parameters associated to a given similarity workload 
(X, d, U) and allowing to gauge the performance of various indexing schemes for similarity -based 
information retrieval. 

Perhaps the best known single feature of a high-dimensional dataset U is "the empty space 
paradox": the average distance, E£nn{X), from a query point to the nearest neighbour in U 
grows along with the dimension of U. Asymptotically, EeNNiX) approaches the average distance 
E[(i(X, Y)] between two points of X (the characteristic size of dataset). This is a consequence of 
the fact that the distance functions dx{y) = d{x, y), x E X, concentrate near their median/mean 
values. The boxplot diagram (a) shows (normalized) pairwise distance distributions of 100 uni- 
formly sampled points in the cube I'^ for various values of d. The lines in the middle of the boxes 
are median values, the lower and upper sides denote the first and third quartiles respectively, and 
the small circles mark outliers. For high d, even the outlier distance values approach the median. 



(a) distance distribution in cubes 



(b) dimension of Chavez et a!., n=3000 
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Generally, the intrinsic dimensionality is inversely proportional to some dispersion parameter 
of data, usually normalized with regard to the characteristic size of U. The possible choices of a 
dispersion parameter are many, and they reflect different aspects of the dataset and its indexability. 
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Illustration: the intrinsic dimensionality of Chavez et al. The authors of ^ have proposed 
and studied a parameter which we denote by the acronym of their last names: diiacNBYMi^) = 
E((i)^/2var (d). Here the distance d is treated as a random variable on X x X. The corresponding 
statistical parameter of a dataset U has a number of advantages: easy to compute, it makes a good 
statistical estimator for the dimension of the domain X. The robustness of "CNBYM dimension" 
can be seen from diag. (b) where the parameter is calculated for independent random samples of 
n = 3, 000 points drawn from the gaussian distribution 'j^ on W^, 1 < rf < 50. 

The curse of dimensionality. When applied to high-dimensional datasets, many similarity-based 
information retrieval algorithms slow down to a point where they cannot outperform a simple se- 
quential scan. Even though the phenomenon is well-known to data practitioners, curiously enough 
there is still no mathematical validation of it, and one of the major theoretical challenges in the field 
is the so-called Curse of Dimensionality Conjecture in the bare-bone setting where X = {0, l}'^ is 
a Hamming cube H. Rigorous lower bounds are only obtained in a small number of cases, some 
of which we will describe below. 

The concentration phenomenon. The curse of dimensionality is closely linked to the phe- 
nomenon of concentration of measure (cf. e.g. [[31, Ch. 3 1/2): on a typical geometric object X of 
high dimension not only the distance functions, but all the 1-Lipschitz functions / : X — ;• M con- 
centrate near their median. (A function is 1-Lipschitz if it does not increase distances: \fx — fy\ < 
d{x, y).) The concentration function ax{s) of X gives an upper bound on the probability that the 
value of a 1-Lipschitz function / at an x G X deviates from its median by more than £ > 0. For 
instance, the concentration function of the rf-dimensional Hamming cube {0, l}'^ with the normal- 
ized Hamming metric and uniform measure satisfies a Chemoff bound a{e) < ex.\){—2e^d), cf. 
diagram (c). Similar gaussian bounds exist for Euclidean spheres, cubes, and a variety of other 



(c) concentration function of d=200 (Hamming cube 



(d) function f concentrates 
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objects. The higher the dimension of X, the more sharply ax{s) drops off near zero. 

Degrading performance of indexing schemes. Many known access methods for exact similarity 

search (including pivot-based schemes, metric trees, etc.) have the following structure. An index- 



ing scheme consists of a collection of 1-Lipschitz functions {/„ : a E A} on the domain X (possi- 
bly partially defined). Given a query point x G X and the range value e > 0, calculations of the 
values fa^{x), i = 1, 2, . . . , ^ are performed in a recursive way, where the function /^^^^ is chosen 
depending on the previous values faiix), . . . , faXx). If for some i one has \faXy) — fai{x)\ > e, 
the datapoint y cannot possibly belong to the e-ball around x and so is irrelevant. Datapoints 
Xi, . . . , Xfc G U whose irrelevance cannot be certified in this way are returned, and each of them is 
checked against the condition d{x, Xi) < e. 

If the domain X has high concentration dimension, then most values of each function /„- con- 
centrate near the median. Now denote ^ the class of all Lipschitz functions used to construct 
indexing schemes of a given type (e.g. the class of all distance functions for pivot schemes). As- 
sume that ^ has a small capacity in the sense of statistical learning theory. (This condition is very 
natural and verified, e.g., by distance functions on the Euclidean spaces or Hamming cubes.) As- 
sume further that the points of U are sampled from X in an i.i.d. fashion. Then the values of each 
/ G t^ at datapoints can be shown to concentrate as well, and only a small proportion of points 
can be discarded, cf. diag. (d). This argument leads to lower bounds on the performance of pivot 
indexing schemes which are linear in the size of the dataset BUl. Similar results can be deduced for 
metric trees. 

One is naturally led to conceRrrarioR6?/men5/c»n of X,definied by dima(X) = 1/ 2 J^ a;x(e) de 

and investigated in [|7]|- Notice however that while dimQ,(X) > dimQ,(U), the relationship between 
the two is not understood: the latter value (dimension of the empirical measure) need not be a good 
statistical estimator for the former. 

Concentration versus complexity. Heuristically, indexability of a workload (X, d, U) amounts to 
the existence of sufficiently many 1-Lipschitz functions / on U which at the same time dissipate (as 
opposed to concentrate) and have a low computational complexity. Thus, indexing means finding 
the right balance between concentration and complexity. 

Here is an example of a "success story" where smallness of values of an intrinsic dimension 
guarantees the existence of an efficient access method. 

Assouad dimension. Tho, Assouad {doubling) dimension of a metric space X is the minimum value 
p > such that every set AinX can be covered by 2^ balls of half the diameter. (The diameter of a 
set A is the supremum of d{x, y), x,y G A.) To appreciate that diiaAssouad{X) is indeed an inverse 
quantity to some parameter of dispersion of data, notice that if X is equipped with a probability 
measure and diincNBYAiO^) is high, then the size of balls as a function of radius grows fast in the 
vicinity of E(d), and so diiaAssouadiX) has to be large. 

As shown in [5|, a low Assouad dimension of a workload allows for the construction of a 
simple and efficient indexing scheme which is essentially a metric tree. The quantity 2^ bounds 
from above the number of leaf nodes (degree) of such a tree. 

Concluding remarks and problems. • A thorough survey of intrinsic dimensionality in the con- 
text of similarity search is [2], though even this authoritative source is not comprehensive. 
• It is safe to predict that the mainstream research direction in the field will be a discovery of 
further "dimensionality parameters of positive type", such as Assouad dimension. More and more 
real datasets that currently appear high-dimensional will turn out to have low, tractable intrinsic 
dimensions. 



• (Ciaccia, Pestov). Gromov's reconstruction theorem (^, p. 120) says that a metric space with 
measure can be recovered from distribution laws of all n-point metric subspaces, for all n. (Com- 
pare with the CNBYM dimension, determined by the distribution law of 2-point subspaces.) How 
much can one infer about indexability from the distribution law of 3-point subspaces? 

• Notice that fractal-type dimensions (cf. |[8l) fit into a general paradigm of intrinsic dimensionality 
outlined by us at the beginning: they reflect the rate of growth of balls/boxes. So does a version of 
intrinsic dimension proposed in ^, the disorder dimension, whose aim is to quantify the situation 
where ¥,(6^^) is close in value to E((i). 

• We assumed above that queries follow the same underlying distribution as datapoints. Now 
suppose we have two underlying measures, /i modelling data and v modelling query centers. Can 
one build a theory of concentration and intrisic dimension for such metric bi-measure spaces? 
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